Skip to content

[addon-operator] add queue head info metric and critical flag to module info#771

Draft
diyliv wants to merge 2 commits into
mainfrom
feature/queue-head-info-metric
Draft

[addon-operator] add queue head info metric and critical flag to module info#771
diyliv wants to merge 2 commits into
mainfrom
feature/queue-head-info-metric

Conversation

@diyliv
Copy link
Copy Markdown
Contributor

@diyliv diyliv commented Jun 2, 2026

What this PR does

Adds two metrics that let us replace the flat D8DeckhouseQueueIsHung alert with severity-differentiated alerts.

New metric: tasks_queue_head_info

A gauge (value=1) with labels queue, module, task_type, hook. Published every 5 seconds for each non-empty queue. Old series are expired when the head changes -> no phantom metrics remain.

Label cleanup:

  • ParallelModuleRun synthetic names like "Parallel run for a, b, c" -> normalized to empty string (would otherwise produce a bad join with deckhouse_mm_module_info)
  • Global tasks (ConvergeModules, GlobalHookRun, DiscoverHelmReleases, ApplyKubeConfigValues) -> module is empty, which is correct since these are not module-specific

New label: critical on deckhouse_mm_module_info

Value "true" or "false" from BasicModule.GetCritical() (the critical: true property in module.yaml). Added additively -> existing queries are unaffected.

Why it's needed

The old D8DeckhouseQueueIsHung alert had two problems:

  • No way to see what's stuck -> only the queue name was visible, not the module, task type, or hook
  • Same severity for everything -> all hung queues alerted at severity 7 regardless of how critical the module was

With these two metrics, we can create three separate alerts:

Alert Severity Triggers for
D8DeckhouseQueueIsHungCritical 4 critical="true" modules
D8DeckhouseQueueIsHung 6 critical="false" modules
D8DeckhouseQueueIsHungGlobal 4 global tasks (module="")

@diyliv diyliv self-assigned this Jun 2, 2026
@diyliv diyliv marked this pull request as draft June 2, 2026 17:03
@diyliv diyliv changed the title add queue head info metric and critical flag to module info [addon-operator] add queue head info metric and critical flag to module info Jun 2, 2026
@diyliv diyliv force-pushed the feature/queue-head-info-metric branch from 2b91642 to 14ed834 Compare June 2, 2026 17:11
diyliv added 2 commits June 3, 2026 16:58
Signed-off-by: diyliv <onlogn081@gmail.com>
Signed-off-by: diyliv <onlogn081@gmail.com>
@diyliv diyliv force-pushed the feature/queue-head-info-metric branch from 14ed834 to 580232f Compare June 3, 2026 13:59
@diyliv diyliv added release-note/enhancement New feature or request publish/image/dev Build and push dev image using PR number as docker tag labels Jun 3, 2026
@github-actions github-actions Bot removed the publish/image/dev Build and push dev image using PR number as docker tag label Jun 3, 2026
@ldmonster ldmonster requested a review from Copilot June 5, 2026 13:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances addon-operator observability for “hung queue” alerting by adding a new “queue head” metric (to show what’s actually stuck) and extending the module info metric with a critical label (to enable severity-differentiated alerts based on module criticality).

Changes:

  • Add tasks_queue_head_info gauge metric (published every 5s for non-empty queues) with labels: queue, module, task_type, hook, expiring old series when the head changes.
  • Extend deckhouse_mm_module_info (mm_module_info) metric with an additive critical={"true"|"false"} label derived from BasicModule.GetCritical().
  • Wire the new queue-head extraction into bootstrap and add unit tests for head-info publication/expiration behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
pkg/module_manager/module_manager.go Adds critical label to module info metric series.
pkg/metrics/metrics.go Introduces tasks_queue_head_info metric and publishes it alongside queue length updates.
pkg/metrics/metrics_test.go Adds tests for queue head info metric creation, normalization, and expiration.
pkg/addon-operator/bootstrap.go Provides a metadata extractor for deriving (module, hook) for the new head-info metric.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +514 to +517
critical := "false"
if bm := mm.GetModule(module); bm != nil && bm.GetCritical() {
critical = "true"
}
Comment thread pkg/metrics/metrics.go
Comment on lines +618 to +639
func updateTasksQueueHeadInfo(tqs *queue.TaskQueueSet, metricStorage metricsstorage.Storage, headInfoExtractor func(metadata interface{}) (module, hook string)) {
metricStorage.Grouped().ExpireGroupMetricByName("tasks_queue_head_info", TasksQueueHeadInfo)

tqs.IterateSnapshot(context.TODO(), func(_ context.Context, q *queue.TaskQueue) {
t := q.GetFirst()
if t == nil {
return
}

module, hook := headInfoExtractor(t.GetMetadata())

// Normalize ParallelModuleRun synthetic module names:
// "Parallel run for a, b, c" -> "" to avoid false joins with deckhouse_mm_module_info.
if strings.HasPrefix(module, "Parallel run for ") {
module = ""
}

metricStorage.Grouped().GaugeSet(
"tasks_queue_head_info",
TasksQueueHeadInfo,
1,
map[string]string{
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note/enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants